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In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering 
as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on 
several nontrivial assumptions about the structure of data. Here we reformulate the clustering problem 
from an information theoretic perspective which avoids many of these assumptions. In particular, our 
formulation obviates the need for defining a cluster "prototype," does not require an a priori similarity 
metric, is invariant to changes in the representation of the data, and naturally captures non-linear relations. 
We apply this approach to different domains and find that it consistently produces clusters that are more 
coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering 
based on collective notions of similarity rather than the traditional pairwise measures. 
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The idea that complex data can be grouped into clus- 
ters or categories is central to our understanding of the 
world, and this structure arises in many diverse contexts 
(e.g.. Fig. 1). In popular culture we group films or books 
into genres, in business we group companies into sectors 
of the economy, in biology we group the molecular com- 
ponents of cells into functional units or pathways, and 
so on. Typically these groupings are first constructed 
by hand using specific but qualitative knowledge; e.g., 
Dell and Apple belong in the same group because they 
both make computers. The challenge of clustering is to 
ask whether these qualitative groupings can be derived 
automatically from objective, quantitative data. Is our 
intuition about sectors of the economy derivable, for ex- 
ample, from the dynamics of stock prices? Are the func- 
tional units of the cell derivable from patterns of gene 
expression under different conditions (1,2)? The litera- 
ture on clustering, even in the context of gene expression, 
is vast (3). Our goal here is not to suggest yet another 
clustering algorithm, but rather to focus on questions 
about the formulation of the clustering problem. We are 
led to an approach, grounded in information theory, that 
should have wide applicability. 

Our intuition about clustering starts with the obvious 
notion that similar elements should fall within the same 
cluster while dissimilar ones should not. But clustering 
also achieves data compression — instead of identifying 
each data point individually, we can identify points by the 
cluster to which they belong, ending up with a simpler 
and shorter description of the data. Rate-distortion the- 
ory (4,5) formulates precisely the tradeoff between these 
two considerations, searching for assignments to clusters 
such that the number of bits used to describe the data 
is minimized while the average similarity between each 
data point and its cluster representative (or prototype) is 
maximized. A well known limitation of this formulation 



(as in most approaches to clustering) is that one needs to 
specify the similarity measure in advance, and quite of- 
ten this choice is made arbitrarily. Another issue, which 
attracts less attention, is that the notion of a representa- 
tive or "cluster prototype" is inherent to this formulation 
although it is not always obvious how to define this con- 
cept. Our approach provides plausible answers to both 
these concerns, with further interesting consequences. 



Theory 

Theoretical Formulation. Imagine that there are 
N elements (i — 1,2,---,N) and Nc clusters (C = 
1, 2, • ■ • , Nc) and that we have assigned elements i to clus- 
ters C according to some probabilistic rules, P{C\i), that 
serve as the variables in our analysis.^ If we reach into 
a cluster and pull out elements at random, we would 
like these elements to be as similar to one another as 
possible. Similarity usually is defined among pairs of ele- 
ments (e.g., the closeness of points in some metric space), 
but as noted below we also can construct more collective 
measures of similarity among r > 2 elements; perhaps 
surprisingly we will see that that this more general case 
can be analyzed at no extra cost. Leaving aside for the 
moment the question of how to measure similarity, let us 
assume that computing the similarity among r elements 
ii, i2, • ■ • , ir returns a similarity measure s(ii, i2, • • • , i^)- 



^ Conventionally, one distinguishes "hard" clustering, in which 
each element is assigned to exactly one cluster, and "soft" clus- 
tering in which the assignments are probabilistic, described by a 
conditional distribution P(C|i); we consider here the more gen- 
eral soft clustering with hard clustering emerging as a limiting 
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The average similarity among elements chosen out of a 
single cluster is 
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= E • • • E ^(iii^) • • • pmc)s{h,- ■ ■ , ir), (1) 

ii=l i^=l 

where P{i\C) is the probability to find element i in clus- 
ter C. This average similarity corresponds to a scenario 
where one chooses the elements {ii, • • • , i,.} at random 
out of a cluster C, independently of each other; other 
formulations might also be plausible. From Bayes' rule 
we have P{i\C) = P{C\i)P{i)/P{C), where P(C) is the 
total probability of finding any clement in cluster C, 
^(C) = X^i -P(C|i)-P(i)- many cases the elements i 
occur with equal probabiUty so that P(i) = 1/N. We 
further consider this case for simplicity, although it is not 
essential. The intuition about the "goodness" of the clus- 
tering is expressed through the average similarity over all 
the clusters, 



(s) = ^p(CMC). 



(2) 



For the special case of pairwise "hard" clustering we ob- 
tain {.s)h = I^C'i.j where |C| is the size of 
cluster C. This simpler form was shown in (6) to satisfy 
basic invariance and robustness criteria. 

The task then is to choose the assignment rules P(C|i) 
that maximize (s), while, as in rate-distortion theory, si- 
multaneously compressing our description of the data as 
much as possible. To implement this intuition we max- 
imize (s) while constraining the information carried by 
the cluster identities (5), 
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Thus, our mathematical formulation of the intuitive clus- 
tering problem is to maximize the functional 



^=(s)-T/(C;i), 



(4) 



where the Lagrange multiplier T enforces the constraint 
on 7(C;i). Notice that, as in other formulations of the 
clustering problem, resembles the free energy in sta- 
tistical mechanics, where the temperature T specifies the 
tradeoff between energy and entropy like terms. 

This formulation is intimately related to conventional 
rate-distortion theory. In rate-distortion clustering one 
is given a fixed number of bits with which to describe 
the data, and the goal is to use these bits so as to 
minimize the distortion between the data elements and 
some representatives of these data. In practice the bits 
specify membership in a cluster, and the representatives 
are prototypical or average patterns in each cluster. 
Here we see that we can formulate a similar tradeoff 
with no need to introduce the notion of a representative 



or average; instead, we measure directly the similarity of 
elements within each cluster; moreover, we can consider 
collective rather than pairwise measures of similarity. A 
more rigorous treatment detailing the relation between 
Eq. (4) and the conventional rate-distortion functional 
will be presented elsewhere. 

Optimal Solution. In general it is not possible to 

find an explicit solution for the P(C|i) that maximize 
J^. However, if we assume that is differentiable with 
respect to the variables P(C|i), equating the deriva- 
tive to zero yields after some algebra a set of implicit, 
self-consistent equations that any optimal solution must 
obey: 

= ^jP) e^p{^N(^;i) - (r- (5) 

where Z{i;T) is a normalization constant and s(C;i) is 
the expected similarity between i and r — 1 members of 
cluster C, 
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s(C;i) = E--- E PiW--- 

ii = l i,.-i = l 

P(ir-i|C)s(ii, • • • , ir-i,i). 



(6) 



The derivation of these equations from the optimization 
of is reminiscent of the derivation of the rate-distortion 
(5) or information bottleneck (7) equations. This simple 
form is valid when the similarity measure is invariant un- 
der permutations of the arguments. In the more general 
case we have 

^^^^^ ^ KC;i('-'))-(r-l)s(C)]|, 

(7) 

where s(C; i^"" ^) is the expected similarity between i and 
r — 1 members of cluster C when i is the r' argument of 
s. 

An obvious feature of Eq. (5) is that clement i should 
be assigned to cluster C with higher probability if it is 
more similar to the other elements in the cluster. Less 
obvious is that this similarity has to be weighed against 
the mean similarity among all the elements in the cluster. 
Thus, our approach automatically embodies the intuitive 
principle that "tightly knit" groups arc more difficult to 
join. We emphasize that we did not explicitly impose this 
property, but rather it emerges directly from the varia- 
tional principle of maximizing J^; most other clustering 
methods do not capture this intuition. 

The probability P(C|i) in Eq. (5) has the form of a 
Boltzmann distribution, and increasing similarity among 
elements of a cluster plays the role of lowering the 
energy; the temperature T sets the scale for converting 
similarity differences into probabilities. As we lower this 
temperature there are a sequence of "phase transitions" 
to solutions with more distinct clusters that achieve 
greater mean similarity in each cluster (8). For a fixed 
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number of clusters, reducing the temperature yields 
more deterministic P(C\i) assignments. 

Algorithm. Although Eq. (5) is an implicit set of 
equations, we can turn this self-consistency condition 
into an iterative algorithm that finds an explicit numer- 
ical solution for P(C|i) that corresponds to a (perhaps 
local) maximum of J^. Fig. 2 presents a pseudo-code 
of the algorithm for the case r — 2. Extending the 
algorithm for the general case of more than pairwise 
relations (r > 2) is straightforward. In principle we 
repeat this procedure for different initializations and 
choose the solution which maximizes J- ^ (s) — TI{C; i). 
We refer to the algorithm described here as Iclust. We 
emphasize that we utilize this algorithm mainly because 
it emerges directly out of the theoretical analysis. 
Other procedures that aim to optimize the same target 
functional are certainly plausible and we expect future 
research to elucidate the potential (dis)advantages of 
the different alternatives. 

Information as a Similarity Measure. In formu- 
lating the clustering problem as the optimization of 
we have used, as in rate-distortion theory, the general- 
ity of information theory to provide a natural measure 
for the cost of dividing the data into more clusters, but 
the similarity measure remains arbitrary and commonly 
is believed to be problem specific. Is it possible to use 
information theory to address this issue as well? To 
be concrete, consider the case where the elements i are 
genes and we are trying to measure the relation between 
gene expression patterns across a variety of conditions 
jjL = 1, 2, • • • , M; gene i has expression level ei(/x) under 
condition /i. We imagine that there is some real dis- 
tribution of conditions that cells encounter during their 
lifetime, and an experiment with a finite set of conditions 
provides samples out of this distribution. Then, for each 
gene we can define the probability density of expression 
levels, 
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which should become smooth as M ^ oo. Similarly we 
can define the joint probability density for the expression 
levels of r genes ii, Z2, • • • , ir, 

^ M 

-Pii-i.(ei, • • • ' e*-) = ]g ^^{e-i-eiAl^)) ■ ■ ■ <5(er-ei,(Ai)). 



(9) 

Given the joint distributions of expression levels, infor- 
mation theory provides natural measures of the relations 
among genes. For r = 2, we can identify the relatedness 
of genes i and j with the mutual information between the 
expression levels, 

s(i,j)=/ij = dei (ie2Pij(ei,e2)--- (10) 



^'ij(ei,e2) 



bits. 



.^'i(ei)Pj(e2). 

This measure is naturally extended to the multi- 
information among multiple variables (9), or genes: 



11,12, 
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bits. 



We recall that the mutual information is the unique 
measure of relatedness between a pair of variables that 
obeys several simple and desirable requirements indepen- 
dent of assumptions about the form of the underlying 
probability distributions (4). In particular, the mutual 
(and multi-) information are independent of invertible 
transformations on the individual variables. For exam- 
ple, the mutual information between the expression levels 
of two genes is identical to the mutual information be- 
tween the log of the expression levels: there is no need 
to find the "right" variables with which to represent the 
data. The absolute scale of the information measure also 
has a clear meaning. For example, if two genes share 
more than one bit of information then the underlying bi- 
ological mechanisms must be more subtle than just turn- 
ing expression on and off. In addition, the mutual infor- 
mation reflects any type of dependence among variables 
while ordinary correlation measures typically ignore non- 
linear dependences. 

While these theoretical advantages are well known, in 
practice information theoretic quantities are notoriously 
difficult to estimate from finite data. For example, al- 
though the distributions in Eq's. (8,9) become smooth 
in the limit of many samples (M oo), with a fi- 
nite amount of data one needs to regularize or discretize 
the distributions, and this could introduce artifacts. Al- 
though there is no completely general solution to these 
problems, we have found that in practice the difficulties 
arc not as serious as one might have expected. Using an 
adaptation of the "direct" estimation method originally 
developed in the analysis of neural coding (10), we have 
found that one can obtain reliable estimates of mutual 
(and sometimes multi-) information values for a variety 
of data types, including gene expression data (11); see the 
supplementary material for details. In particular, exper- 
iments which explore gene expression levels under > 100 
conditions arc sufficicnit to estimate the mutual informa- 
tion between pairs of genes with an accuracy of ~ 0.1 
bits.2 



^ It should be noted that in applications where there is a natural 
similajrity measure it might be advantageous to use this measure 
directly. Furthermore, in situations where the number of obser- 
vations is not sufficient for non— parametric estimates of the in- 
formation relations, other heuristic similarity measures should be 
employed or one could use parametric models for the underlying 
distributions. Notice, though, that these alternative measures 
can be incorporated into the algorithm in Fig. 2. 
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To summarize, we have suggested a purely informa- 
tion theoretic approach to clustering and categoriza- 
tion: relatedness among elements is defined by the 
mutual (or multi-) information, and optimal cluster- 
ing is defined as the best tradeoff between maximizing 
this average relatedness within clusters and minimiz- 
ing the number of bits required to describe the data. 
The result is a formulation of clustering that trades 
bits of similarity against bits of descriptive power, with 
no further assumptions. A freely available web im- 
plementation, of the clustering algorithm and the mu- 
tual information estimation procedure can be found at 
lhttp://www. genomics. princeton.edu/biophysics-theory , 



Results 

Gene Expression. As a first test case we consider 
experiments on the response of gene expression levels in 
yeast to various forms of environmental stress (12). Pre- 
vious analysis identified a group of ~ 300 stress-induced 
and ~ 600 stress-repressed genes with "nearly identi- 
cal but opposite patterns of expression in response to the 
environmental shifts" (13), and these genes were termed 
the environmental stress response (ESR) module. In fact, 
based on this observation, these genes were excluded from 
recent further analysis of the entire yeast genome (14). 
Nonetheless, as we shall see next, our approach automat- 
ically reveals further rich and meaningful substructure in 
these data. 

As seen in Fig. 3A, differences in expression pro- 
files within the ESR module indeed are relatively subtle. 
However, when considering the mutual information rela- 
tions (Fig. 3B) a relatively clear structure emerges. We 
have solved our clustering problem for r = 2 and vari- 
ous numbers of clusters and temperatures. The resulting 
concave tradeoff curves between (s) and /(C; i) are shown 
in Fig. 4A. We emphasize that we generate not a single 
solution, but a whole family of solutions describing struc- 
ture at different levels of complexity. With the number of 
clusters fixed, (s) gradually saturates as the temperature 
is lowered and the constraint on I{C; i) is relaxed. For 
the sake of brevity we focused our analysis on the four so- 
lutions for which the saturation of (s) is relatively clear 
(1/T = 25). At this temperature, ~ 85% of the genes 
have nearly deterministic assignments to one of the clus- 
ters [P(C|i) > 0.9 for a particular C]. As an illustration, 
three of the twenty clusters found at this temperature are 
in fact the clusters presented in Fig. 1. 

We have assessed the biological significance of our re- 
sults by considering the distribution of gene annotations 
across the clusters and estimating the corresponding 
clusters' coherence with respect to all three Gene 



^ Specifically, the coherence of a cluster (14) is defined as the per- 
centage of elements in this cluster which are annotated by an 



Ontologies (15). Almost all of our clusters were signifi- 
cantly enriched in particular annotations. We compared 
our performance to 18 different conventional clustering 
algorithms that are routinely applied to this data type 
(16). We employed the clustering software, available at 
http://bonsai.ims.u-tokyo.ac.jp^mdehoon/software/cluster/\ 
to implement the conventional algorithms. In Fig. 5 
we see that our clusters obtained the highest average 
coherence, typically by a significant margin. Moreover, 
even when the competing algorithms cluster the log2 
of expression (ratio) profiles — a common regularization 
used in this application with no formal justification — 
our results are comparable or superior to all of the 
alternatives. 

Instead of imposing a hierarchical structure on the 
data, as done in many popular clustering algorithms, here 
we directly examine the relations between solutions at 
different numbers of clusters that were found indepen- 
dently.^ In Fig. 6 we see that an approximate hierarchy 
emerges as a result rather than as an implicit assump- 
tion, where some functional modules (e.g., the "ribosome 
cluster", Cis) are better preserved than others. 

Our attention is drawn also to the cluster C7, which 
is found repeatedly at different numbers of clusters. 
Specifically, at the solution with 20 clusters, among the 
114 repressed genes in Cj, 69 have an uncharacterized 
molecular function; this level of concentration has a 
probability of ^ 10^^^ to have arisen by chance. One 
might have suspected that almost every process in the 
cell has a few components that have not been identified, 
and hence that as these processes are regulated there 
would be a handful of unknown genes that are regulated 
in concert with many genes of known function. At least 
for this cluster, our results indicate a different scenario 
where a significant portion of tightly co-expressed genes 
remain uncharacterized to date. 

Stock Prices. To emphasize the generality of our 
approach we consider a very different data set, the day- 
to-day fractional changes in price of the stocks in the 
Standard and Poor's (S fc P) 5 00 list (available at 
http://www. standardandpoors. com \ , during the trading 
days of 2003. We cluster these data exactly as in our 
analysis of gene expression data. The resulting tradeoff 
curves are shown in Fig. 4B, and again we focus on the 
four solutions where (s) already saturates. 

To determine the coherence of the ensuing clusters we 



annotation that was found to be significantly enriched in this 
cluster (P— val < 0.05, with the Bonferroni correction for multi- 
ple hypotheses). See the supplementary material for a detailed 
discussion regarding the statistical validation of our results. 
In standard agglomerative or hierarchical clustering one starts 
with the most detailed partition of singleton clusters and obtains 
new solutions through merging of clusters. Consequently, one 
must end up with a tree-like hierarchy of clustering partitions, 
regardless of whether the data structure actually supports this 
description. 
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used the Global Industry Classification Standard (GICS) 
(available at http://'wrds. wharton.upenn.edu) which clas- 
sifies companies at four different levels: sector, industry 
group, industry, and sub-industry. Thus each company 
is assigned four annotations, which are organized in a hi- 
erarchical tree, somewhat similar to the Gene Ontology 
hierarchical annotation (15). 

As before, our average coherence performance is com- 
parable to or superior to all the other 18 clustering al- 
gorithms we examined (Fig. 5). Almost all our clusters, 
at various levels of Nc, exhibit a surprisingly high degree 
of coherence with respect to the "functional labels" that 
correspond to the different (sub) sectors of the economy. 
The four independent solutions, at Nc = {5, 10, 15, 20} 
and 1/T — 35, naturally form an approximate hierarchy 
(see Fig. 10 of Supporting Material). 

We have analyzed in detail the results for Nc — 20 
and 1 /T = 35 where selections from three of the derived 
clusters are shown in Fig. 1. Eight of the clusters are 
found to be perfectly (100%) coherent, capturing subtle 
differences between industrial sectors. For example, two 
of the perfectly coherent clusters segregate companies 
into cither investment banking and asset management 
(e.g., Merill Lynch) or commercial regional banks (e.g., 
PNC). Even in clusters with less than perfect coherence 
we are able to observe and explain relationships between 
intra-cluster companies above and beyond what the 
annotations may suggest. For example, one cluster is en- 
riched with "Hotel Resorts and Cruise Line" companies 
at a coherence level of 30%. Nonetheless, the remaining 
companies in this cluster seem also to be tied with the 
tourism industry, like the Walt Disney Co., banks which 
specialise in credit card issuing and so on. 

Movie Ratings. Finally, we consider a third test case 
of yet another different nature: movie ratings provided 
by more than 70, 000 viewers (the EachMovie database, 
see 

http://www.research.digital.com/SRC/eachmovie/ ). 
Unlike the previous cases, the data here is already 
naturally quantized since only six possible ratings were 
permitted. 

We proceed as before to cluster the 500 movies that 
received the maximal number of votes. The resulting 
tradeoff curves are presented in Fig. 4C. Few clusters 
are preserved amongst the solutions at different numbers 
of Nc, suggesting that a hierarchical structure may not be 
a natural representation of the data. Cluster coherence 
was determined with respect to the genre labels provided 
in the database: action, animation, art-foreign, classic, 
comedy, drama, family, horror, romance, and thriller. 
Fig. 5 demonstrates that our results are superior to all 
the other 18 standard clustering algorithms. 

We have analyzed in detail the results for Nc = 20 and 
1/T = 40 where, once again, selections from three of the 
derived clusters are shown in Fig. 1. The clusters indeed 
reflect the various genres, but also seem to capture subtle 
distinctions between sets of movies belonging to the same 



genre. For example, two of the clusters are both enriched 
in the action genre, but one group consists mainly of 
science-fiction movies and the other consists of movies in 
contemporary settings. 

Details of all three applications are given in a 
separate technical report, deposited on ArXiv as 
http://arxiv.org/abs/q-bio.QM/O5ilO42 



Discussion 

Measuring the coherence of clusters corresponds to 
asking if the automatic, objective procedure embodied in 
our optimization principle does indeed recover the intu- 
itive labeling constructed by human hands. Our success 
in recovering functional categories in different systems us- 
ing exactly the same principle and practical algorithm is 
encouraging. It should be emphasized that our approach 
is not a model of each system and that there is no need 
for making data-dependent decisions in the representa- 
tion of the data, nor in the definition of similarity. 

Most clustering algorithms embody — perhaps 
implicitly — different models of the underlying sta- 
tistical structure.^ In principle, more accurate models 
should lead to more meaningful clusters. However, 
the question of how to construct an accurate model 
obviously is quite involved, raising further issues that 
often are addressed arbitrarily before the cluster analysis 
begins. Moreover, as is clear from Fig. 5, an algorithm 
or model which is successful in one data type might fail 
completely in a different domain; even in the context 
of gene expression, successful analysis of data taken 
under one set of conditions does not necessarily imply 
success in a different set of conditions, even for the same 
organism. Our use of information theory allows us to 
capture the relatedness of different patterns independent 
of assumptions about the nature of this relatedness. 
Correspondingly, we have a single approach which 
achieves high performance across different domains. 

Finally, our approach can succeed where other methods 
would fail qualitatively. Conventional algorithms search 
for linear or approximately linear relations among the 
different variables, while our information theoretic ap- 
proach is responsive to any type of dependencies, includ- 
ing strongly nonlinear structures. In addition, while the 
cluster analysis literature has focused thus far on pairwise 
relations and similarity measures, our approach sets a 
sound theoretical framework for analyzing complex data 
based on higher order relations. Indeed, it was recently 
demonstrated, both in principle (17) and in practice (18), 
that in some situations the data structure is obscured 
at the pairwise level, but clearly manifests itself only at 



For example, the /f-means algorithm corresponds to maximiz- 
ing the likelihood of the data on the assumption that these are 
generated through a mixture of spherical Gaussians. 
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higher levels. The question of how common such data [18] Bowers, P. M., Cokus, S. J., Eisenberg, D. & Yeates, T. 
are, as well as the associated computational difficulties O. (2004) Science 306, 2246-2249. 

in analyzing such higher order relations, is yet to be ex- 
plored. 
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FIG. 1 Examples of clusters in three different data sets. For each cluster, a sample of five typical items is presented. All 
clusters were found through the same automatic procedure. 
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Input: 

Pairwise similarity matrix, s(u, 12), V ii = 1, N, 12 = 1, N . 
Trade-off parameter, T . 
Requested number of clusters, Nc ■ 
Convergence parameter, e . 

Output: 

A (typically "soft") partition of the N elements into clusters. 

Initialization: 

TO = . 

P''"^(C|«) ^ A random (normalized) distribution V i = 1, N . 
While True 

For every i = 1, N : 

. p(rn+i) ^ p(m) ((^^ I ^ jgst") (C; i) - s^™) (C)] I , V C = 1, A^c . 

. p('"+i)(qi)^-^g:^^l!i^ 

• m <— m + 1 . 

If Vi = l,...,iV, VC = 1,..., TVc we have \P('^+'^\C\i) - P^'^\C\i)\ < e , 
Break. 



FIG. 2 Pseudo-code 



of the iterative algorithm for the case of pairwise relations (r = 2). 
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FIG. 3 ESR data and information relations. (A) Expression profiles of the ~ 900 genes in the yeast ESR module across the 
173 microarray stress experiments (12). (B) Mutual information relations (in bits) among the ESR genes. In both panels the 
genes are sorted according to the solution with 20 clusters and a relatively saturated (s). Inside each cluster, genes are sorted 
according to their average mutual information relation with other cluster members. 
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FIG. 4 Tradeoff curves in all three applications. In every panel, each curve describes the solutions obtained for a particular 
number of clusters. Different points along each curve correspond to different local maxima of at different T values. (A) 
Tradeoff curves for the ESR data with ^ = {5, 10, 15, 20, 25}. In Fig. 6 we explore the possible hierarchical relations between 
the four saturated solutions at ^ = 25. (B) Tradeoff curves for the S&P 500 data with ^ = {15, 20, 25, 30, 35}. (C) Tradeoff 
curves for the EachMovie data with ^ = {20, 25, 30, 35, 40}. 
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FIG. 5 Comparison of coherence results of our approach (yeUow) with conventional clustering algorithms (16): K-me&ns 
(green); /('-medians (blue); Hierarchical (red). For the hierarchical algorithms, four different variants are tried: complete, 
average, centroid, and single linkage, respectively from left to right. For every algorithm, three different similarity measures 
are applied: Pearson correlation (left); absolute value of Pearson correlation (middle); Euclidean distance (right). The white 
bars in the ESR data correspond to applying the algorithm to the logj transformation of the expression ratios. In all cases, 
the results are averaged over all the different numbers of clusters that we tried: Nc = 5, 10, 15, 20. For the ESR data coherence 
is measured with respect to each of the three Gene Ontologies and the results are averaged. 
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FIG. 6 Relations between the optimal solutions with Nc = {5, 10, 15, 20} at i = 25 for the ESR data. Every cluster is 
connected to the cluster in the next - less detailed - partition that absorbs its most significant portion. The edge type indicates 
the level of inclusion. The independent solutions form an approximated hierarchical structure. At the upper level the clusters 
are sorted as in Fig. 3. The number above every cluster indicates the number of genes in it, and the text title corresponds 
to the most enriched GO biological-process annotation in this cluster. The titles of the five clusters at the lower level are 
their most enriched GO cellular-component annotation. Most clusters were enriched with more than one annotation, hence 
the short titles sometimes are too concise. Red and green clusters represent clusters with a clear majority of stress-induced or 
stress-repressed genes, respectively. 



